Method for object detection using knowledge distillation

ABSTRACT

A method that may include training a student ODNN to mimic a teacher ODNN. The training may include calculating a teacher student detection loss that is based on a pre-bounding-box output of the teacher ODNN. The pre-bounding-box output of the teacher ODNN is a function of pre-bounding-box outputs of different ODNNs that belong to the teacher ODNN. The method may also include detecting one or more objects in an image, by feeding the image to the trained student ODNN; outputting by the trained student ODNN a student pre-bounding-box output; and calculating one or more bounding boxes based on the student pre-bounding-box output.

BACKGROUND

Object detection is required in various systems and applications.

There is a growing need to provide a method and a system that may beable to provide highly accurate object detection at a low cost.

SUMMARY

There may be provided an object detection system that may include atrained student object detection neural network (ODNN) that was trainedto mimic a teacher ODNN; wherein the training may include calculating ateacher student detection loss that may be based on a pre-bounding-boxoutput of the teacher ODNN; wherein the pre-bounding-box output of theteacher ODNN may be a function of pre-bounding-box outputs of differentODNNs that belong to the teacher ODNN; wherein the trained student ODNNmay be configured to receive an image and output a studentpre-bounding-box output that may be indicative of one or more objects inthe image; and a bounding box unit that may be configured to receive thestudent pre-bounding-box output and to calculate one or more boundingboxes based on the student pre-bounding-box output.

There may be provided a non-transitory computer readable medium that maystore instructions for: training a student object detection neuralnetwork (ODNN) to mimic a teacher ODNN; wherein the training may includecalculating a teacher student detection loss that may be based on apre-bounding-box output of the teacher ODNN; wherein thepre-bounding-box output of the teacher ODNN may be a function ofpre-bounding-box outputs of different ODNNs that belong to the teacherODNN; and detecting one or more objects in an image; wherein thedetecting may include: feeding the image to the trained student ODNN;outputting by the trained student ODNN a student pre-bounding-boxoutput; and calculating one or more bounding boxes based on the studentpre-bounding-box output.

There may be provided method for object detection, the method mayinclude training a student object detection neural network (ODNN) tomimic a teacher ODNN; wherein the training may include calculating ateacher student detection loss that may be based on a pre-bounding-boxoutput of the teacher ODNN; wherein the pre-bounding-box output of theteacher ODNN may be a function of pre-bounding-box outputs of differentODNNs that belong to the teacher ODNN; and detecting one or more objectsin an image; wherein the detecting may include: feeding the image to thetrained student ODNN; outputting by the trained student ODNN a studentpre-bounding-box output; and calculating one or more bounding boxesbased on the student pre-bounding-box output.

The pre-bounding-box outputs of the teacher ODNN may be a weighted sumof the pre-bounding-box output of the different ODNNs; and wherein themethod may include calculating, by the teacher ODNN, weights to beapplied during a calculation of the weighted sum.

The method wherein each one of the pre-bounding-box outputs of thedifferent ODNNs may include an objectiveness confidence level indicativeof an existence of an object; and wherein the calculating of the weightsmay include applying a function on objectiveness confidence level of thepre-bounding-box outputs of the different ODNNs.

The function may be a softmax function.

The function may be a max function.

The function may be a sigmoid function.

The calculating of the weights may include training a weight learningneural network of the teacher ODNN.

The calculating of the weight may be done per anchor out of a set ofanchors.

The method may include calculating the weighted sum of thepre-bounding-box outputs of the different ODNNs by applying differentweights to different parts of the pre-bounding-box outputs of thedifferent ODNNs.

BRIEF DESCRIPTION OF THE DRAWINGS

The embodiments of the disclosure will be understood and appreciatedmore fully from the following detailed description, taken in conjunctionwith the drawings in which:

FIG. 1 illustrates an example of a method;

FIG. 2 illustrates an example of a teacher object detection neuralnetwork (ODNN) learning process;

FIG. 3 illustrates an example of a knowledge distillation process;

FIG. 4 illustrates an example of a knowledge distillation process;

FIG. 5 illustrates an example of an object detection process; and

FIG. 6 is an example of an image, different image segments, anchors anda YOLO compliant pre-bounding-box output.

DESCRIPTION OF EXAMPLE EMBODIMENTS

In the following detailed description, numerous specific details are setforth in order to provide a thorough understanding of the invention.However, it will be understood by those skilled in the art that thepresent invention may be practiced without these specific details. Inother instances, well-known methods, procedures, and components have notbeen described in detail so as not to obscure the present invention.

The subject matter regarded as the invention is particularly pointed outand distinctly claimed in the concluding portion of the specification.The invention, however, both as to organization and method of operation,together with objects, features, and advantages thereof, may best beunderstood by reference to the following detailed description when readwith the accompanying drawings.

It will be appreciated that for simplicity and clarity of illustration,elements shown in the figures have not necessarily been drawn to scale.For example, the dimensions of some of the elements may be exaggeratedrelative to other elements for clarity. Further, where consideredappropriate, reference numerals may be repeated among the figures toindicate corresponding or analogous elements.

Because the illustrated embodiments of the present invention may for themost part, be implemented using electronic components and circuits knownto those skilled in the art, details will not be explained in anygreater extent than that considered necessary as illustrated above, forthe understanding and appreciation of the underlying concepts of thepresent invention and in order not to obfuscate or distract from theteachings of the present invention.

Any reference in the specification to a method should be applied mutatismutandis to a device or system capable of executing the method and/or toa non-transitory computer readable medium that stores instructions forexecuting the method.

Any reference in the specification to a system or device should beapplied mutatis mutandis to a method that may be executed by the system,and/or may be applied mutatis mutandis to non-transitory computerreadable medium that stores instructions executable by the system.

Any reference in the specification to a non-transitory computer readablemedium should be applied mutatis mutandis to a device or system capableof executing instructions stored in the non-transitory computer readablemedium and/or may be applied mutatis mutandis to a method for executingthe instructions.

Any combination of any module or unit listed in any of the figures, anypart of the specification and/or any claims may be provided.

There may be provided a highly efficient system and method for objectdetection.

The highly efficient system benefits from the benefits of both knowledgedistillation and from ensemble.

Regarding ensemble—a teacher object detection neural network (ODNN)includes multiple ODNNs that differ from each other (represent differentmodels) and generates a teacher ODNN that benefits from the contributionof the multiple ODNNs. This improves the performance of the teachesODNN.

In order to enjoy the benefits of the ensemble, the outputs of the ODNNsshould be raw or preliminary outputs—and not bounding boxes. There rawor preliminary outputs are also referred to as pre-bounding-box outputs.

The knowledge embedded in the teacher ODNN is distilled to a studentODNN that is much smaller (for example by a factor of at least 5, 10,50, 100 and the like) to provide a student ODNN that mimics the teacherODNN—but at a fraction of cost/size/ and power consumption of theteacher ODNN.

In order to benefit from the ensambling

In the following figures the teacher ODNN includes three ODNNs—whereasthe number of ODNNs of the teacher ODNN may be two or above four.

FIG. 1 illustrated method 7300.

Method 7300 is for object detection.

Method 7300 may include steps 7310 and 7330.

Step 7310 may include training a student object detection neural network(ODNN) to mimic a teacher ODNN. The training may include calculating ateacher student detection loss that is based on a pre-bounding-boxoutput of the teacher ODNN. The pre-bounding-box output of the teacherODNN is a function of pre-bounding-box outputs of different ODNNs thatbelong to the teacher ODNN.

The pre-bounding-box outputs of the teacher ODNN is a weighted sum ofthe pre-bounding-box output of the different ODNNs.

Step 7310 may include step 7312 of calculating, by the teacher ODNN,weights to be applied during a calculation of the weighted sum.

Each one of the pre-bounding-box outputs of the different ODNNs mayinclude an objectiveness confidence level indicative of an existence ofan object.

Step 7312 may include applying a function on objectiveness confidencelevel of the pre-bounding-box outputs of the different ODNNs.

The function may be, for example, a softmax function, a max function, asigmoid function, and the like.

Step 7312 may include training a weight learning neural network of theteacher ODNN. The training is aimed to determine the weights of theweighted sum.

The calculating of the weight can be done per anchor, per multipleanchors or per all anchors of a set of anchors. The set of anchors isusually a parameter of the object detection method.

The weighted sum may be applied on all outputs of the ODNNs—but may alsoinclude applying different weights to different parts of thepre-bounding-box outputs of the different ODNNs.

Step 7310 may be followed by step 7330 of detecting one or more objectsin an image.

Step 7330 may include a sequence of steps that includes steps 7332, 7334and 7336.

Step 7332 may include feeding the image to the trained student ODNN.

Step 7334 may include outputting by the trained student ODNN a studentpre-bounding-box output.

Step 7336 may include calculating one or more bounding boxes based onthe student pre-bounding-box output.

Step 7310 may include, for example, segmenting an image to regions, andwithin each region trying to find bonding boxes (of a set of boundingboxes) and the probabilities related to the inclusion of an objectwithin the bounding boxes. Step 7310 may apply object detectionalgorithms such as but not limited to any one of the Yolo family (forexample, —YOLO, YOLO2, YOLO3, and the like).

FIG. 2 illustrates an example of a teacher ODNN learning process whichis a supervised learning process.

The teacher ODNN 700 includes three ODNNs (first ODNN 7011, second ODNN7012, and third ODNN 7013) that differ from each other (representdifferent models).

The three ODNNs are fed, during the teaches ODNN learning process withfirst training images TI1 7005 and output first, second and third ODNNoutput vectors OV1 7021, OV2 7022 and OV2 7033 respectively. These threeoutput vectors are pre-bounding box output vectors.

These three output vectors are sent to a weighted sum unit 7040 thatcalculates a weighted sum of the three output vectors. The weights usedin the calculation of the weighted sum are calculated by a weightlearning neural network 7030.

The weighted sum is the output vector (TOV 7042) of the teacher ODNN.During the learning process, output vector TOV 7042 and the expectedoutcome 7052 (the supervised learning process knows which objects appearin TI1 7005) are fed to a teacher object detection loss unit 7050 thatcalculates the errors/object detection loss and feeds the error to theweight learning neural network 7030—in order to allow the weightlearning neural network 7030 to adjust itself and provide betterweights.

It should be noted that the weight learning neural network 7030 mayreceive the entire first, second and third ODNN output vectors, or onlyparts of the first, second and third ODNN output vectors. Forexample—weight may be learnt per anchor, per two or more anchors, andthe like.

It should be noted that the calculating of the weighted sum may includeapplying the weights to the entire first, second and third ODNN outputvectors but that different weights may applied to different parts of thefirst, second and third ODNN output vectors. For example—differentfields of the first, second and third ODNN output vectors may be addedusing different weights.

FIG. 3 illustrates an example of a knowledge distillation process 7002.The knowledge distillation process may be executed after the completionof the teacher ODNN training process illustrated in FIG. 2 .

The knowledge distillation process may include feeding second trainingimages TI2 7007 to the teacher ODNN and to the student ODNN 7090 inorder to train the student ODNN to mimic the teacher ODNN.

The knowledge distillation process is a supervised training process andthe student ODNN 7090 is fed by a teacher student object detection loss(from a teacher student OD loss unit 7080)—that reflects the differencebetween the outputs of the student and the teacher.

The student ODNN 7090 is also fed by its own object detection loss(difference from expected outcome 7054) calculated by student OD lossunit 7091.

FIG. 4 illustrates an example of a knowledge distillation process 7004in which the teacher ODNN does not learn the weights by using a weightlearning neural network 7030—but rather applies a objectiveness basedweight calculation by an objectiveness based weight calculator 7031.

FIG. 5 illustrates an example of an object detection process 7006 thatmay occur after knowledge distillation process process 7002 or knowledgedistillation process process 7004.

Third images 13 7009 that should undergo an object detection process arefed to a student ODNN 7090 (after the student ODNN was trained) thatoutputs a pre-bounding-box output SOV 7092 that is fed to a bounding boxunit 7110 for calculating of bounding boxes.

FIG. 6 is an example of an image 7200 that include a person 7201 and avehicle 7202. The image is segmented to segments and each segments isprocessed using the trained student ODNN.

FIG. 6 illustrates three anchors 7205. Each anchor defines therelationship between dimensions (such as a ratio between height andwidth) of potential bounding boxes.

FIG. 6 also shows an example of a pre-bounding-box output 7210 that mayinclude coordinates (x,y,h,w) 7211, objectiveness 7212 and class 7213.The coordinate indicate the location (x,y) as well as the height andwidth. Objectiveness 7212 provides a confidence level that an objectexists. Class—class of object—for example cat, dog, vehicle, person . .. ). Wherein x,y are coordinates (for example row and column), h is theheights and w is the width.

While the foregoing written description of the invention enables one ofordinary skill to make and use what is considered presently to be thebest mode thereof, those of ordinary skill will understand andappreciate the existence of variations, combinations, and equivalents ofthe specific embodiment, method, and examples herein. The inventionshould therefore not be limited by the above described embodiment,method, and examples, but by all embodiments and methods within thescope and spirit of the invention as claimed.

In the foregoing specification, the invention has been described withreference to specific examples of embodiments of the invention. It will,however, be evident that various modifications and changes may be madetherein without departing from the broader spirit and scope of theinvention as set forth in the appended claims.

Moreover, the terms “front,” “back,” “top,” “bottom,” “over,” “under”and the like in the description and in the claims, if any, are used fordescriptive purposes and not necessarily for describing permanentrelative positions. It is understood that the terms so used areinterchangeable under appropriate circumstances such that theembodiments of the invention described herein are, for example, capableof operation in other orientations than those illustrated or otherwisedescribed herein.

Furthermore, the terms “assert” or “set” and “negate” (or “deassert” or“clear”) are used herein when referring to the rendering of a signal,status bit, or similar apparatus into its logically true or logicallyfalse state, respectively. If the logically true state is a logic levelone, the logically false state is a logic level zero. And if thelogically true state is a logic level zero, the logically false state isa logic level one.

Those skilled in the art will recognize that the boundaries betweenlogic blocks are merely illustrative and that alternative embodimentsmay merge logic blocks or circuit elements or impose an alternatedecomposition of functionality upon various logic blocks or circuitelements. Thus, it is to be understood that the architectures depictedherein are merely exemplary, and that in fact many other architecturesmay be implemented which achieve the same functionality.

Any arrangement of components to achieve the same functionality iseffectively “associated” such that the desired functionality isachieved. Hence, any two components herein combined to achieve aparticular functionality may be seen as “associated with” each othersuch that the desired functionality is achieved, irrespective ofarchitectures or intermedial components. Likewise, any two components soassociated can also be viewed as being “operably connected,” or“operably coupled,” to each other to achieve the desired functionality.

Furthermore, those skilled in the art will recognize that boundariesbetween the above described operations merely illustrative. The multipleoperations may be combined into a single operation, a single operationmay be distributed in additional operations and operations may beexecuted at least partially overlapping in time. Moreover, alternativeembodiments may include multiple instances of a particular operation,and the order of operations may be altered in various other embodiments.

Also for example, in one embodiment, the illustrated examples may beimplemented as circuitry located on a single integrated circuit orwithin a same device. Alternatively, the examples may be implemented asany number of separate integrated circuits or separate devicesinterconnected with each other in a suitable manner

However, other modifications, variations and alternatives are alsopossible. The specifications and drawings are, accordingly, to beregarded in an illustrative rather than in a restrictive sense.

In the claims, any reference signs placed between parentheses shall notbe construed as limiting the claim. The word ‘comprising’ does notexclude the presence of other elements or steps then those listed in aclaim. Furthermore, the terms “a” or “an,” as used herein, are definedas one or more than one. Also, the use of introductory phrases such as“at least one” and “one or more” in the claims should not be construedto imply that the introduction of another claim element by theindefinite articles “a” or “an” limits any particular claim containingsuch introduced claim element to inventions containing only one suchelement, even when the same claim includes the introductory phrases “oneor more” or “at least one” and indefinite articles such as “a” or “an.”The same holds true for the use of definite articles. Unless statedotherwise, terms such as “first” and “second” are used to arbitrarilydistinguish between the elements such terms describe. Thus, these termsare not necessarily intended to indicate temporal or otherprioritization of such elements. The mere fact that certain measures arerecited in mutually different claims does not indicate that acombination of these measures cannot be used to advantage.

While certain features of the invention have been illustrated anddescribed herein, many modifications, substitutions, changes, andequivalents will now occur to those of ordinary skill in the art. It is,therefore, to be understood that the appended claims are intended tocover all such modifications and changes as fall within the true spiritof the invention.

It is appreciated that various features of the embodiments of thedisclosure which are, for clarity, described in the contexts of separateembodiments may also be provided in combination in a single embodiment.Conversely, various features of the embodiments of the disclosure whichare, for brevity, described in the context of a single embodiment mayalso be provided separately or in any suitable sub-combination.

It will be appreciated by persons skilled in the art that theembodiments of the disclosure are not limited by what has beenparticularly shown and described hereinabove. Rather the scope of theembodiments of the disclosure is defined by the appended claims andequivalents thereof.

What is claimed is:
 1. A method for object detection, the methodcomprises: calculating, by a teacher object detection neural network(ODNN), weights to be applied during a calculation of a weighted sum;training a student object detection neural network (ODNN) to mimic ateacher ODNN; wherein the training comprises calculating a teacherstudent detection loss that is based on a pre-bounding-box output of theteacher ODNN that is a weighted sum of pre-bounding-box outputs ofdifferent ODNNs that belong to the teacher ODNN; and detecting one ormore objects in an image; wherein the detecting comprises: feeding theimage to the trained student ODNN; outputting by the trained studentODNN a student pre-bounding-box output; and calculating one or morebounding boxes based on the student pre-bounding-box output.
 2. Themethod according to claim 1, wherein each one of the pre-bounding-boxoutputs of the different ODNNs comprises an objectiveness confidencelevel indicative of an existence of an object; and wherein thecalculating of the weights comprises applying a function onobjectiveness confidence level of the pre-bounding-box outputs of thedifferent ODNNs.
 3. The method according to claim 2 wherein the functionis a softmax function.
 4. The method according to claim 2 wherein thefunction is a max function.
 5. The method according to claim 2 whereinthe function is a sigmoid function.
 6. The method according to claim 1,wherein the calculating of the weights comprises training a weightlearning neural network of the teacher ODNN.
 7. The method according toclaim 6 wherein the calculating of the weight is done per anchor out ofa set of anchors.
 8. The method according to claim 1, comprisingcalculating the weighted sum of the pre-bounding-box outputs of thedifferent ODNNs by applying different weights to different parts of thepre-bounding-box outputs of the different ODNNs.
 9. A non-transitorycomputer readable medium that stores instructions for: calculating, by ateacher object detection neural network (ODNN), weights to be appliedduring a calculation of a weighted sum; training a student ODNN to mimicthe teacher ODNN; wherein the training comprises calculating a teacherstudent detection loss that is based on a pre-bounding-box output of theteacher ODNN that is a weighted sum of pre-bounding-box outputs ofdifferent ODNNs that belong to the teacher ODNN; and detecting one ormore objects in an image; wherein the detecting comprises: feeding theimage to the trained student ODNN; outputting by the trained studentODNN a student pre-bounding-box output; and calculating one or morebounding boxes based on the student pre-bounding-box output.
 10. Thenon-transitory computer readable medium according to claim 9 whereineach one of the pre-bounding-box outputs of the different ODNNscomprises an objectiveness confidence level indicative of an existenceof an object; and wherein the calculating of the weights comprisesapplying a function on objectiveness confidence level of thepre-bounding-box outputs of the different ODNNs.
 11. The non-transitorycomputer readable medium according to claim 10 wherein the function is asoftmax function.
 12. The non-transitory computer readable mediumaccording to claim 10 wherein the function is a max function.
 13. Thenon-transitory computer readable medium according to claim 10 whereinthe function is a sigmoid function.
 14. The non-transitory computerreadable medium according to claim 11 wherein the calculating of theweights comprises training a weight learning neural network of theteacher ODNN.
 15. The non-transitory computer readable medium accordingto claim 14 wherein the calculating of the weight is done per anchor outof a set of anchors.
 16. The non-transitory computer readable mediumaccording to claim 9 that stores instructions for calculating theweighted sum of the pre-bounding-box outputs of the different ODNNs byapplying different weights to different parts of the pre-bounding-boxoutputs of the different ODNNs.